NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

OCR-Enabled Dataset Assembly for Machine Learning in Olivine Geochemistry

https://doi.org/10.22541/essoar.176659907.72242256/v1

wang, ping; Pelaia, Dominick; He, Nathan; Zhu, Flora; Mao, Nathan; He, Amanda; Xiao, Stephen; Zou, Zhitong; Wu, Jinqi; Wu, Jian; et al (December 2025, ESS Open Archive)

Free, publicly-accessible full text available December 24, 2026
Rethinking Neural-based Matrix Inversion: Why can’t, and Where can

Ji, Yuliang; Wu, Jian; Xi, Yuanzhe (May 2025, PMLR)
Li, Yingzhen; Mandt, Stephan; Agrawal, Shipra; Khan, Emtiyaz (Ed.)
Free, publicly-accessible full text available May 3, 2026
Uncertainty Quantification in Table Structure Recognition

Ajayi, Kehinde; Zhang, Leizhen; He, Yi; Wu, Jian (May 2024, IEEE 25th International Conference on Information Reuse and Integration)

Quantifying uncertainties for machine learning models is a critical step to reduce human verification effort by detecting predictions with low confidence. This paper proposes a method for uncertainty quantification (UQ) of table structure recognition (TSR). The proposed UQ method is built upon a mixture-of-expert approach termed Test-Time Augmentation (TTA). Our key idea is to enrich and diversify the table representations, to spotlight the cells with high recognition uncertainties. To evaluate the effectiveness, we proposed two heuristics to differentiate highly uncertain cells from normal cells, namely, masking and cell complexity quantification. Masking involves varying the pixel intensity to deem the detection uncertainty. Cell complexity quantification gauges the uncertainty of each cell by its topological relation with neighboring cells. The evaluation results based on standard benchmark datasets demonstrate that the proposed method is effective in quantifying uncertainty in TSR models. To our best knowledge, this study is the first of its kind to enable UQ in TSR tasks.
more » « less
Full Text Available
Field and Laboratory Evidence That Chlorpyrifos Exposure Reduced the Population Density of a Freshwater Snail by Increasing Juvenile Mortality

https://doi.org/10.1021/acs.est.4c04202

Han, Guixin; Kong, Ren; Liu, Chunsheng; Huang, Kai; Xu, Qiaolin; Wu, Jian; Fei, Jiamin; Zhang, Hui; Su, Guanyong; Letcher, Robert J; et al (October 2024, Environmental Science & Technology)

Full Text Available
CMRM: A cross-modal reasoning model to enable zero-shot imitation learning for robotic RFID inventory in unstructured environments

Yongshuai Wu, Jian Zhang (December 2023, IEEE)

Full Text Available
Online Learning from Evolving Feature Spaces with Deep Variational Models

https://doi.org/10.1109/TKDE.2023.3326365

Lian, Heng; Wu, Di; Hou, Bo-Jian; Wu, Jian; He, Yi (January 2024, IEEE Transactions on Knowledge and Data Engineering)

In this paper, we explore a novel online learning setting, where the online learners are presented with “doubly-streaming” data. Namely, the data instances constantly streaming in are described by feature spaces that over-time evolve, with new features emerging and old features fading away. The main challenge of this problem lies in the fact that the newly emerging features are described by very few samples, resulting in weak learners that tend to make error predictions. A seemingly plausible idea to overcome the challenge is to establish a relationship between the old and new feature spaces, so that an online learner can leverage the knowledge learned from the old features to better the learning performance on the new features. Unfortunately, this idea does not scale up to high-dimensional feature spaces that entail very complex feature interplay. Specifically. a tradeoff between onlineness, which biases shallow learners, and expressiveness, which requires deep models, is inevitable. Motivated by this, we propose a novel paradigm, named Online Learning Deep models from Data of Double Streams (OLD3S), where a shared latent subspace is discovered to summarize information from the old and new feature spaces, building an intermediate feature mapping relationship. A key trait of OLD3S is to treat the model capacity as a learnable semantics, aiming to yield optimal model depth and parameters jointly in accordance with the complexity and non-linearity of the input data streams in an online fashion. To ablate its efficacy and applicability, two variants of OLD3S are proposed namely, OLD-Linear that learns the relationship by a linear function; and OLD-FD learns that two consecutive feature spaces pre-and-post evolution with fixed deep depth. Besides, instead of re-starting the entire learning process from scratch, OLD3S learns multiple newly emerging feature spaces in a lifelong manner, retaining the knowledge from the learned and vanished feature space to enjoy a jump-start of the new features’ learning process. Both theoretical analysis and empirical studies substantiate the viability and effectiveness of our proposed approach.
more » « less
Full Text Available
MSVEC: A Multidomain Testing Dataset for Scientific Claim Verification

https://doi.org/10.1145/3565287.3617630

Evans, Michael; Soós, Dominik; Landers, Ethan; Wu, Jian (October 2023, ACM)

Full Text Available
Phase-field modeling of stochastic fracture in heterogeneous quasi-brittle solids

https://doi.org/10.1016/j.cma.2023.116332

Wu, Jian-Ying; Yao, Jing-Ru; Le, Jia-Liang (November 2023, Computer Methods in Applied Mechanics and Engineering)

Full Text Available
Scholarly big data quality assessment: a case study of document linking and conflation with S2ORC

https://doi.org/10.1145/3558100.3563850

Wu, Jian; Hiltabrand, Ryan; Soós, Dominik; Giles, C. Lee (September 2022, ACM Symposium on Document Engineering. (DocEng 2022))

Recently, the Allen Institute for Artificial Intelligence released the Semantic Scholar Open Research Corpus (S2ORC), one of the largest open-access scholarly big datasets with more than 130 million schol- arly paper records. S2ORC contains a significant portion of automat- ically generated metadata. The metadata quality could impact down- stream tasks such as citation analysis, citation prediction, and link analysis. In this project, we assess the document linking quality and estimate the document conflation rate for the S2ORC dataset. Using semi-automatically curated ground truth corpora, we estimated that the overall document linking quality is high, with 92.6% of documents correctly linking to six major databases, but the linking quality varies depending on subject domains. The document confla- tion rate is around 2.6%, meaning that about 97.4% of documents are unique. We further quantitatively compared three near-duplicate detection methods using the ground truth created from S2ORC. The experiments indicated that locality-sensitive hashing was the best method in terms of effectiveness and scalability, achieving high performance (F1=0.960) and a much reduced runtime. Our code and data are available at https://github.com/lamps-lab/docconflation.
more » « less
Design Considerations for a Sustainable Scholarly Big Data Service

https://doi.org/10.1145/3574318.3574340

Wu, Jian; Rohatgi, Shaurya; Angadi, Manoj K.; Puranik, Kavya S.; Giles, C. Lee (December 2022, Forum for Information Retrieval Evaluation. (FIRE 2022))

he advancement of web programming techniques, such as Ajax and jQuery, and datastores, such as Apache Solr and Elasticsearch, have made it much easier to deploy small to medium scale web- based search engines. However, developing a sustainable search engine that supports scholarly big data services is still challenging often because of limited human resources and financial support. Such scenarios are typical in academic settings or small businesses. Here, we showcase how four key design decisions were made by trading-off competing factors such as performance, cost, and effi- ciency, when developing the Next Generation CiteSeerX (NGX), the successor of CiteSeerX, which was a pioneering digital library search engine that has been serving academic communities for more than two decades. This work extends our previous work in Wu et al. (2021) and discusses design considerations of infrastruc- ture, web applications, indexing, and document filtering. These design considerations can be generalized to other web-based search engines with a similar scale that are deployed in small business or academic settings with limited resources.
more » « less
Full Text Available

« Prev Next »

Search for: All records